Skip to content

fix(RHOAIENG-25120): remove kueue as mandatory in RayCluster #832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

chipspeak
Copy link
Contributor

@chipspeak chipspeak commented May 29, 2025

Issue link

Jira Link

What changes have been made

Verification steps

  • With appropriate Kueue config (a cluster queue and local queue CR on your cluster) create a RayCluster with an invalid label. This should return the VAP error. In this instance, it's not our code blocking the creation but the VAP itself.
  • With the above but leaving local queue blank or set to an existing one, resource should create without issue.
  • With Kueue removed, passing an invalid local_queue will result in a message to user but no hard error.
  • With Kueue removed, RayCluster creation works as intended, no hard errors or blockages.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Copy link

codecov bot commented May 29, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.48%. Comparing base (68d1f6c) to head (859b562).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #832   +/-   ##
=======================================
  Coverage   92.48%   92.48%           
=======================================
  Files          24       24           
  Lines        1423     1424    +1     
=======================================
+ Hits         1316     1317    +1     
  Misses        107      107           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@chipspeak
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2025
@laurafitzgerald
Copy link
Contributor

The changes look good here @chipspeak would you mind adding verification steps, i'm happy to verify it. :)

Copy link
Contributor

@kryanbeane kryanbeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Contributor

openshift-ci bot commented Jun 3, 2025

@kryanbeane: changing LGTM is restricted to collaborators

In response to this:

lgtm!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chipspeak chipspeak mentioned this pull request Jun 4, 2025
4 tasks
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2025
@laurafitzgerald
Copy link
Contributor

laurafitzgerald commented Jun 9, 2025

Verification

In a jupyter notebook running on my machine
built from the is branch and installed via
%pip install ../../dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl --force-reinstall
I'm using python 3.11

Kueue Installed

Invalid Queue
Creating a Clusterconfig with a local_queue with a value that there isn't a local_queue with that name in the namespace. The output is
local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration

Applying this cluster results in the following error

WARNING: RayCluster creation rejected by the validating admission policy. Please contact your administrator.
Unexpected API error encountered (Reason: Unprocessable Entity)
Response: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io \"raytest\" is forbidden: ValidatingAdmissionPolicy 'kueue-validating-admission-policy' with binding 'kueue-validating-admission-policy-binding' denied request: The label 'kueue.x-k8s.io/queue-name' is either missing or does not have a value set.","reason":"Invalid","details":{"name":"raytest","group":"ray.io","kind":"rayclusters","causes":[{"message":"ValidatingAdmissionPolicy 'kueue-validating-admission-policy' with binding 'kueue-validating-admission-policy-binding' denied request: The label 'kueue.x-k8s.io/queue-name' is either missing or does not have a value set."}]},"code":422}

Valid Queue

Creating a Clusterconfig with a local_queue with a value that there IS a. localqueue with that name in the namespace
output is
Yaml resources loaded for raytest

RayCluster is created with the label kueue.x-k8s.io/queue-name: example

Kueue Not Installed

kueue set to Removed in DSC
and localqueue crd removed from the cluster

output is

The requested resource could not be located.
Please verify the resource name and namespace.
Response: 404 page not found

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration
Yaml resources loaded for other

I was able to clluster.up the resulting raytest cr.

WARNING: The up() function is planned for deprecation in favor of apply().
Ray Cluster: 'other' has successfully been created

Copy link
Contributor

@laurafitzgerald laurafitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chipspeak validation went fine. 👍
Just some small suggestions about wording to avoid k8s specific language. also the VAP is being replace so making it more generic will mean we wont need to change the wording going forward

Copy link
Contributor

openshift-ci bot commented Jun 9, 2025

@laurafitzgerald: changing LGTM is restricted to collaborators

In response to this:

Hi @chipspeak validation went fine. 👍
Just some small suggestions about wording to avoid k8s specific language. also the VAP is being replace so making it more generic will mean we wont need to change the wording going forward

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2025
@laurafitzgerald
Copy link
Contributor

/ok-to-test

Copy link
Contributor

openshift-ci bot commented Jun 9, 2025

@laurafitzgerald: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@laurafitzgerald
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jun 10, 2025
@laurafitzgerald laurafitzgerald added the test-guided-notebooks Run PR check to verify Guided notebooks label Jun 10, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2025
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2025
@laurafitzgerald
Copy link
Contributor

/approve
/lgtm

Copy link
Contributor

openshift-ci bot commented Jun 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kryanbeane, laurafitzgerald

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [kryanbeane,laurafitzgerald]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@laurafitzgerald
Copy link
Contributor

/override verify-ray_job_client

Copy link
Contributor

openshift-ci bot commented Jun 10, 2025

@laurafitzgerald: Overrode contexts on behalf of laurafitzgerald: verify-ray_job_client

In response to this:

/override verify-ray_job_client

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chipspeak
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 10, 2025
@laurafitzgerald
Copy link
Contributor

/override verify-0_basic_ray verify-1_cluster_job_client verify-2_basic_interactive verify-3_widget_example

Copy link
Contributor

openshift-ci bot commented Jun 10, 2025

@laurafitzgerald: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • verify-0_basic_ray
  • verify-1_cluster_job_client

Only the following failed contexts/checkruns were expected:

  • add-approve-lgtm-label
  • tide
  • verify-2_basic_interactive
  • verify-3_widget_example
  • verify-local_interactive

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

In response to this:

/override verify-0_basic_ray verify-1_cluster_job_client verify-2_basic_interactive verify-3_widget_example

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit e967fb7 into project-codeflare:main Jun 10, 2025
20 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. test-guided-notebooks Run PR check to verify Guided notebooks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants